Deoxyribonucleic acid, or DNA, is the genetic material that codes for all of the diversity of life. The information in DNA is stored in four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). DNA bases pair up with each other, A with T and C with G, to form units called base pairs. Each base is attached to a sugar molecule and a phosphate molecule to make up the double helix backbone of DNA. The order, or sequence, of bases provides the information for building and maintaining individual organisms. DNA contains genes that code for proteins, which are created through the processes of transcription and translation. During transcription, DNA is translated to messenger RNA (mRNA) and during translation the sequence of mRNA is read. Each sequence of three bases codes for an amino acid and mRNA is converted to a protein one amino acid at a time. Differences in genetic sequencing is known as genetic variation.
Genetic variation can be examinined at locations in the genome under selection (adaptive loci), or locations that are not under selection (neutral loci).
The four main proccess that influence genetic variation are:
1) Mutations Mutation is the process that creates genetics variants or alleles due to errors in DNA replication.
2) Selection Selection impacts genetic variation by increasing the frequencies of alleles that are favorable in a particular environment and decreasing the frequencies of alleles that are less favorable in that environment. If a locus or a mutation is neutral, selection does not directly influence the frequency of alleles.
3) Gene Flow Gene flow occurs as a result of migration (dispersal and subsequent reproduction) and this process moves alleles between populations and tends to make them more similar genetically.
4) Genetic Drift Genetic drift is the change in allele frequencies due to randomsampling effects as alleles are passed on from one generation to the next. Genetic drift is thus influenced by the number of breeding individuals and the variance in reproductive success among breeders.
Genetic drift is directly linked to effective population size (Ne). Ne was developed to mathematically and conceptually represent the effects of genetic drift, and Ne is the population genetic analog to census size used in ecology. Ne is described as the number of breeding individuals in an idealized population that contribute genetically to the next generation. A simplified but useful approach for understanding Ne is to think about it as the number of individuals that are able to successfully pass on their genes to the next generation. Populations with larger Ne will be able to maintain higher amounts of genetic variation because more new alleles are created by mutation due to a larger number of breeding events and fewer alleles are lost to genetic drift since there is a larger sample of breeders each generation. Ne also influences the relative impacts of genetic drift and selection in natural populations . For example, as Ne decreases, the effect of selection decreases relative to the effect of genetic drift and important adaptive genetic variation can be lost.
The Two Main Components of Genetic Varation:
1) Genetic Diversity
Genetic diversity is the amount of genetic variation found within an individual, a population, or a spatial area.
2) Genetic Structure
Genetic structure is the distribution of genetic variation among individuals, populations, or areas.
Landscapes influence the amount and distribution of genetic variation in two main ways:
1) The landscape influences where an organism can live and how many individuals live in a particular location that affects the distribution and amount of genetic variation
2) The landscape influences the amount of movement and subsequent gene flow that occurs among individuals in different locations, which directly affects Ne and the amount and distribution of genetic variation
Two Main Types of DNA in animal cells:
1) Mitochondrial DNA (mtDNA)
MtDNA is a circular DNA molecule of the mitochondrion, an organelle that is the energy powerhouse of cells. MtDNA has a uniparental mode of inheritance in animals and plants as it is generally passed only from mother to offspring, but paternal transmission has been documented in some animals and conifer species.
2) Nuclear DNA (nDNA)
Nuclear DNA is the DNA of chromosomes found in the nucleus of a cell and has a biparental mode of inheritance since genetic material is inherited fromboth the mother and the father.
On average, the Ne of nDNA loci is four times higher than the Ne of cpDNA and mtDNA because organellar DNA is most often uniparentally inherited and haploid. The type of DNA, its mutation rate, and mode of inheritance influence the research questions that can be addressed and the temporal inference that can be obtained from genetic data. In general, nDNA loci are used more frequently in landscape genetic studies (90%) than mtDNA or cpDNA loci (Storfer et al. 2010) because they provide better resolution for detecting recent changes to the landscape.
Adaptive versus Neautral Loci
The two main types of nDNA molecular methods:
1)Codominant
In codominant methods like microsatellite analysis or single-nucleotide polymorphism (SNP) analysis, it is possible to visualize both alleles at a particular locus in the form of a genotype (AA versus AB, for example).
2)Dominant
While dominant loci methods, such as amplified fragment length polymorphism (AFLP), create banding patterns of tens to hundreds of bands for each individual that resemble a barcode, both alleles from a particular locus cannot be identified. Dominant loci data are generally recorded in a binary presence/ absence format, where 1 indicates presence and 0 indicates absence (0, 1, 0, 0, 0, 1, for example).
Mitochondrial DNA and cpDNA are generally analyzed using DNA sequencing or restriction fragment length polymorphism (RFLP), approaches that generate haplotype data rather than genotype data since the loci are haploid.
Two main types of analysis for genetic data in landscape genetics:
1) Individuals
Individual is chosen as unit of analysis when the species is more continuously distributed across a landscape and there is no population boundaries.
2) Populations
Population is chosen as the unit of analysis for species that are patchily distributed
This rule states that for a diploid organism with two alleles at one locus, if p represents the frequency of one allele and q represents the frequency of the other allele, then genotype frequencies will be p2 + 2pq +q2 =1, whereby p2 represents the pp genotype or the p homozygote frequency, 2pq represents the heterozygote frequency, and q2 represents the q homozygote frequency.
These allele frequencies are expected under Hardy–Weinberg equilibrium (HWE) in one generation of random mating in a population, that:
1) is infinite in size
2) has no selection
3) has no mutation
4) has no migration
5) has random mating
Under Mendel’s law of independent assortment, it is assumed that the cross-generational transmission of alleles at any particular locus is independent of alleles at other loci and that the fitnesses of possible pairs of alleles are decoupled. It follows, then, that linkage disequilibrium exists when there is a statistical (non-random) association between alleles at different loci. These alleles can be physically linked on chromosomes, whereby they are in near proximity to one another and those inherited together. Alternatively, other evolutionary processes such as genetic drift in small populations can create linkage disequilibrium among pairs of alleles. Moreover, selection can cause genetic “hitchhiking”,where by alleles of little-to-no effect on fitness are inherited together with alleles favored under selection.
Populations are not infinite in size and Ne is almost always smaller than the census population size or the number of individuals estimated from field or other surveys. Unequal sex ratio, assortative mating, and overlapping generations are all examples ofdemographic factors that make effective population sizes smaller than census population sizes.
There are two commonly estimated measures of effective population size:
1) Variance effective size (NeV) is the size of an ideal population experiencing drift at the same rate as the actual population.
2) Inbreeding effective size (NeI) is the size of an ideal population losing heterozygosity, due to increased relatedness, at the same rate as the actual population.
For stable, large populations NeI and NeV are similar, but when populations are growing or shrinking they can differ greatly.
Genetic drift results in random changes in allele frequencies across generations resulting from sampling of gametes, and thus has larger effects on smaller populations than larger populations. In diploid organisms, heterozygosity (the proportion of individuals with 2 different alleles at a locus) is lost at a rate of 1/(2Ne) at each generation due to genetic drift. As such, the reduction of genetic diversity due to drift has become a management concern for small populations or populations of endangered species because maintenance of genetic diversity is recognized as important for the evolutionary potential of populations to respond to future environmental change.
Extreme and rapid declines of (effective) population size result in population bottlenecks and thereby often rapid reduction in genetic diversity resulting from drift.
Mutation is the ultimate source of genetic variation in populations and species. Mutations occur when mistakes are made during DNA replication and can take two main forms:
1) Point mutations
Point mutations are changes of a single nucleotide, such as A to G or G to C. As much of the genome is non-coding, point mutations are often considered “silent” as they do not result in changes in protein structure. Even when point mutations do occur in segments of DNA that code for amino acids (i.e., exons), they often do not result in changes of amino acid sequences (and are called synonymous substitutions) due to the conservation of the genetic code (i.e., 64 possible codons code for only 20 amino acids). Point mutations that do change amino acid sequences are called non-synonymous substitutions.
2) Insertions/Deletions of segments of DNA
Insertions or deletions change the length of a strand of DNA and are hence called frameshift mutations when they occur in exons because they shift the reading frame that is transcribed, often resulting in major changes in amino acid sequences or even premature termination of protein formation.
Gene flow is defined as the movement of genes (via successful mating) among populations. High rates of gene flow tend to homogenize populations, thereby resulting in similar allele frequencies. Conversely, low rates of gene flow tend to result in population isolation and genetic differentiation. In turn, if populations have a small Ne, genetic drift will act to differentiate these populations genetically via random changes in allele frequencies. However, it only takes small amounts of migration (about 1 effective migrant per generation in theory) to overcome drift, which is a relatively weak evolutionary force.
The concept of isolation-by distance was first envisioned by Sewall Wright (1943, 1951) and assumes that genetic distance among populations is positively correlated with geographic distance among those populations.
When the population is the unit of analysis, there are four main summary metrics for codominant nDNA loci:
1) number of alleles
2) allelic richness
3) heterozyosity
4) proportion of polymorphic loci
There are many metrics used to estimate individual heterozygosity using codominant loci:
Heterozygosity (H)
Standardized heterozygosity (SH)
Internal relatedness (IR)
Homozygosity by loci (HL)
One of the key components of a landscape genetic analysis is evaluating genetic structure (i.e., differentiation) among populations and individuals. When populations are panmictic, or completely randomly mating, there are no discernable differences in allele frequencies and estimates of gene flow become infinite.
The oldest and most widely used population-level metric of genetic structure is Wright’s FST, one of three F statistics derived by Sewall Wright. F-statistics were developed to describe the partitioning of genetic variation within a species. FST is a measure of subpopulation-level genetic differentiation relative to the total population that ranges from 0 (panmixia)to 1 (complete genetic isolation) and measures allele frequency divergence among sub populations.
Adaptive loci: A locus where the frequency of alleles is affected by selection.
Neutral Loci: A locus where the frequency of alleles is not affected by selection.
Mutation: A change in the DNA sequence of an organism. Mutations that occur in the DNA of germ cells can be passed on to the next generation.
Alleles: A unique genetic variant observed at a particular locus.
Locus (Loci): A locus is a particular location in the genome of an organism. The term loci is the plural form of locus.
Effective Population Size (Ne): The number of breeding individuals in an idealized population that would show the same amount of change in allele frequencies under random genetic drift or the same amount of inbreeding as the population under consideration.
Mitochondiral DNA (mtDNA): A circular DNA molecule of the mitochondrion which is haploid and generally passed on only from mother to offspring.
Nuclear DNA: The DNA of the chromosomes found in the nucleus of a cell.
Chloroplast DNA (cpDNA): A circular DNA molecule found in the chloroplast of plants.
Haploid: An organism or a cell that contains only one set of chromosomes in its genome.
Genotype: The combination of alleles observed at a particular locus for a specific individual.
Haplotype: An allele or combination of alleles passed on from a single parent.
Homozygote: In a diploid organism, an individual that has two copies of the same allele on homologous chromosomes.
Heterozygote: In a diploid organism, an individual that has two different alleles at each of its homologous chromosomes.
Linkage Disequilibrium: Non-random association of alleles at two or more loci; these alleles tend to be inherited together significantly more often than expected by random chance.
Inbreeding: Mating between closely related individuals.
Bottlenecks: When a population goes through a period where its effective population size is extremely small, resulting in an increase in the effects of genetic drift and a consequent loss of genetic variation.
Synonymous Substitutions: Mutations in coding DNA that do not result in changes in amino acid sequence. Often called “silent substitutions”.
Non-Synonymous Substitutions: Mutations in coding DNA that result in changes in amino acid sequence.
Demes: A deme is a group of individuals that is sufficiently genetically isolated from other groups of individuals and can also be considered a population.
Panmixia: Random mating of individuals within a population or geographic area.
Assignment Tests: Methods that use genotypic information to evaluate population membership of sampled individuals.
Coalescent: A body of theory that investigates time of divergence from a common ancestor. In population genetics, this theory can be applied to understanding differences in allele frequencies among populations.
Likelihood-Based: A quantity that describes the probability of the data in a particular statistical model conditional on the model parameters.
This worked example shows how to:
Check markers and populations (polymorphism, HWE, linkage, null alleles).
Assess genetic diversity.
Aggregate genetic data at the population level.
Microsatellite data for 181 individuals of Colombia spotted frogs (Rana luteiventris) from 12 populations. Site-level spatial coordinates and attributes. The data are a subsample of the full data set analyzed in Funk et al. (2005) and Murphy et al. (2010). Please see the separate introduction to the data set.
ralu.loci: Data frame with populations and genetic data (181 rows x 9 columns). Included in package ‘LandGenCourse.’ To load it, type: data(ralu.loci)
ralu.site: Spatial points data frame with spatial coordinates and site variables Included in package GeNetIt’. To load it, type: data(ralu.site)
All required packages should have been installed already when you installed ‘LandGenCourse.’
require(adegenet)
## Loading required package: adegenet
## Loading required package: ade4
##
## /// adegenet 2.1.5 is loaded ////////////
##
## > overview: '?adegenet'
## > tutorials/doc/questions: 'adegenetWeb()'
## > bug reports/feature requests: adegenetIssues()
require(LandGenCourse)
## Loading required package: LandGenCourse
require(pegas)
## Loading required package: pegas
## Loading required package: ape
## Registered S3 method overwritten by 'pegas':
## method from
## print.amova ade4
##
## Attaching package: 'pegas'
## The following object is masked from 'package:ape':
##
## mst
## The following object is masked from 'package:ade4':
##
## amova
require(sp)
## Loading required package: sp
require(PopGenReport)
## Loading required package: PopGenReport
## Loading required package: knitr
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Registered S3 method overwritten by 'genetics':
## method from
## [.haplotype pegas
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(poppr)
## Loading required package: poppr
## This is poppr version 2.9.3. To get started, type package?poppr
## OMP parallel support: unavailable
Before we do landscape genetic analysis, we need to perform a basic population genetic analysis of the genetic data, in order to better understand the nature and quality of the data and to check for underlying assumptions of population genetic models and corresponding methods.
Adapted from Week 1 tutorial:
Note: we use the double colon notation ‘package::function(argument)’ to indicate, for each function, which package it belongs to.
data(ralu.loci, package="LandGenCourse")
Frogs <- data.frame(FrogID = paste(substr(ralu.loci$Pop, 1, 3),
row.names(ralu.loci), sep="."), ralu.loci)
Frogs.genind <- adegenet::df2genind(X=Frogs[,c(4:11)], sep=":", ncode=NULL,
ind.names= Frogs$FrogID, loc.names=NULL,
pop=Frogs$Pop, NA.char="NA", ploidy=2,
type="codom", strata=NULL, hierarchy=NULL)
Frogs.genind
## /// GENIND OBJECT /////////
##
## // 181 individuals; 8 loci; 39 alleles; size: 55.5 Kb
##
## // Basic content
## @tab: 181 x 39 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 3-9)
## @loc.fac: locus factor for the 39 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: adegenet::df2genind(X = Frogs[, c(4:11)], sep = ":", ncode = NULL,
## ind.names = Frogs$FrogID, loc.names = NULL, pop = Frogs$Pop,
## NA.char = "NA", ploidy = 2, type = "codom", strata = NULL,
## hierarchy = NULL)
##
## // Optional content
## @pop: population of each individual (group size range: 7-23)
The genetic resolution depends on the number of markers and their polymorphism. The table above and the summary function for genind objects together provide this information. Now we run the summary function:
summary(Frogs.genind)
##
## // Number of individuals: 181
## // Group sizes: 21 8 14 13 7 17 9 20 19 13 17 23
## // Number of alleles per locus: 3 4 4 4 9 3 4 8
## // Number of alleles per group: 21 21 20 22 20 19 19 25 18 14 18 26
## // Percentage of missing data: 10.64 %
## // Observed heterozygosity: 0.1 0.4 0.09 0.36 0.68 0.02 0.38 0.68
## // Expected heterozygosity: 0.17 0.47 0.14 0.59 0.78 0.02 0.48 0.74
The output of the summary function shows us the following:
8 loci with 3 - 9 alleles (39 in total)
Expected heterozygosity varies between 0.14 (locus C) and 0.78 (locus E)
There’s a reasonable level of missing values (10.6%)
See also: http://dyerlab.github.io/applied_population_genetics/hardy-weinberg-equilibrium.html
For a very large population (no drift) with random mating and non-overlapping generations (plus a few more assumptions about the mating system), and in the absence of mutation, migration (gene flow) and selection, we can predict offspring genotype frequencies from allele frequencies of the parent generation (Hardy-Weinberg equilibrium). In general, we don’t expect all of these assumptions to be met (e.g., if we want to study gene flow or selection, we kind of expect that these processes are present). Note: plants often show higher levels of departure from HWE than animals.
Here are p-values for two alternative tests of deviation from HWE for each locus. Columns:
chi^2: value of the classical chi-squared test statistic
df: degrees of freedom of the chi-squared test
Pr(chi^2 >): p-value of the chi-squared test (‘>’ indicates that the alternative is ‘greater,’ which is always the case for a chi-squared test)
Pr.exact: p-value from an exact test based on Monte Carlo permutation of alleles (for diploids only). The default is B = 1000 permutations (set B = 0 to skip this test). Here we use the function ‘round’ with argument ‘digits = 3’ to round all values to 3 decimals.
round(pegas::hw.test(Frogs.genind, B = 1000), digits = 3)
## chi^2 df Pr(chi^2 >) Pr.exact
## A 40.462 3 0.000 0.000
## B 17.135 6 0.009 0.041
## C 136.522 6 0.000 0.000
## D 83.338 6 0.000 0.000
## E 226.803 36 0.000 0.000
## F 0.024 3 0.999 1.000
## G 12.349 6 0.055 0.007
## H 76.813 28 0.000 0.000
Both tests suggest that all loci except for locus “F” are out of HWE globally (across all 181 individuals). Next, we check for HWE of each locus in each population.
Notes on the code: The curly brackets ‘{ }’ below are used to keep the output from multiple lines together in the html file. Function ‘seppop’ splits the genind object by population. We use ‘sapply’ to apply the function ‘hw.test’ from package ‘pegas’ to each population (see this week’s video and tutorial). We set ‘B=0’ to specify that we don’t need any permutations right now. The function ‘t’ takes the transpose of the resulting matrix, which means it flips rows and columns. This works on a matrix, not a data frame, hence we use ‘data.matrix’ to temporarily interpret the data frame as a matrix.
# Chi-squared test: p-value
HWE.test <- data.frame(sapply(seppop(Frogs.genind),
function(ls) pegas::hw.test(ls, B=0)[,3]))
HWE.test.chisq <- t(data.matrix(HWE.test))
{cat("Chi-squared test (p-values):", "\n")
round(HWE.test.chisq,3)}
## Chi-squared test (p-values):
## A B C D E F G H
## Airplane 0.092 0.359 1.000 0.427 0.680 1.000 0.178 0.051
## Bachelor 1.000 0.557 0.576 0.686 0.716 1.000 0.414 0.609
## BarkingFox 0.890 0.136 0.005 0.533 0.739 0.890 0.708 0.157
## Bob 0.764 0.864 0.362 0.764 0.033 1.000 0.860 0.287
## Cache 1.000 0.325 0.046 0.659 0.753 1.000 0.709 0.402
## Egg 1.000 0.812 1.000 1.000 0.156 1.000 0.477 0.470
## Frog 1.000 0.719 0.070 0.722 0.587 1.000 0.564 0.172
## GentianL 0.809 0.059 1.000 0.028 0.560 0.717 0.474 0.108
## ParagonL 1.000 0.054 0.885 0.709 0.868 1.000 0.291 0.000
## Pothole 1.000 1.000 1.000 0.488 0.248 1.000 0.296 0.850
## ShipIsland 0.807 0.497 1.000 0.521 0.006 1.000 0.498 0.403
## Skyhigh 0.915 0.493 0.063 0.001 0.155 1.000 0.126 0.078
Let’s repeat this with a Monte Carlo permutation test with B = 1000 replicates:
# Monte Carlo: p-value
HWE.test <- data.frame(sapply(seppop(Frogs.genind),
function(ls) pegas::hw.test(ls, B=1000)[,4]))
HWE.test.MC <- t(data.matrix(HWE.test))
{cat("MC permuation test (p-values):", "\n")
round(HWE.test.MC,3)}
## MC permuation test (p-values):
## A B C D E F G H
## Airplane 0.015 1.000 1.000 0.415 0.623 1 0.246 0.008
## Bachelor 1.000 0.460 1.000 1.000 0.855 1 0.499 0.603
## BarkingFox 1.000 0.221 0.067 1.000 0.753 1 1.000 0.162
## Bob 1.000 1.000 1.000 1.000 0.019 1 1.000 0.278
## Cache 1.000 0.367 0.131 1.000 1.000 1 1.000 0.611
## Egg 1.000 1.000 1.000 1.000 0.087 1 0.548 0.475
## Frog 1.000 1.000 0.082 1.000 0.464 1 1.000 0.171
## GentianL 1.000 0.065 1.000 0.077 0.649 1 0.621 0.142
## ParagonL 1.000 0.152 1.000 1.000 1.000 1 0.344 0.087
## Pothole 1.000 1.000 1.000 1.000 0.547 1 0.543 1.000
## ShipIsland 1.000 0.611 1.000 0.697 0.141 1 0.510 0.422
## Skyhigh 1.000 0.357 0.182 0.088 0.107 1 0.070 0.032
To summarize, let’s calculate, for each locus, the proportion of populations where it was out of HWE. Here we’ll use the conservative cut-off of alpha = 0.05 for each test. There are various ways of modifying this, including a simple Bonferroni correction, where we divide alpha by the number of tests, which you can activate here by removing the ## i. front of the line.
We write the results into a data frame ‘Prop.loci.out.of.HWE’ and use ‘=’ to specify the name for each column.
alpha=0.05
Prop.loci.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 2, mean),
MC=apply(HWE.test.MC<alpha, 2, mean))
Prop.loci.out.of.HWE # Type this line again to see results table
## Chisq MC
## A 0.00000000 0.08333333
## B 0.00000000 0.00000000
## C 0.16666667 0.00000000
## D 0.16666667 0.00000000
## E 0.16666667 0.08333333
## F 0.00000000 0.00000000
## G 0.00000000 0.00000000
## H 0.08333333 0.16666667
And similarly, for each population, the proportion of loci that were out of HWE:
Prop.pops.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 1, mean),
MC=apply(HWE.test.MC<alpha, 1, mean))
Prop.pops.out.of.HWE
## Chisq MC
## Airplane 0.000 0.250
## Bachelor 0.000 0.000
## BarkingFox 0.125 0.000
## Bob 0.125 0.125
## Cache 0.125 0.000
## Egg 0.000 0.000
## Frog 0.000 0.000
## GentianL 0.125 0.000
## ParagonL 0.125 0.000
## Pothole 0.000 0.000
## ShipIsland 0.125 0.000
## Skyhigh 0.125 0.125
The results suggest that:
While most loci are out of HWE globally, this is largely explained by subdivision (variation in allele frequencies among local populations indicating limited gene flow).
No locus is consistently out of HWE across populations (loci probably not affected by selection).
No population is consistently out of HWE across loci (probably no recent major bottlenecks/ founder effects).
Let’s repeat this with ‘false discovery rate’ correction for the number of tests. Here we use the function ‘p.adjust’ with the argument ‘method=“fdr”’ to adjust the p-values from the previous tests. This returns a vector of length 96 (the number of p-values used), which we convert back into a matrix of 12 rows (pops) by 8 columns (loci). Then we proceed as above.
Chisq.fdr <- matrix(p.adjust(HWE.test.chisq,method="fdr"),
nrow=nrow(HWE.test.chisq))
MC.fdr <- matrix(p.adjust(HWE.test.MC, method="fdr"),
nrow=nrow(HWE.test.MC))
Prop.pops.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 1, mean),
MC=apply(HWE.test.MC<alpha, 1, mean),
Chisq.fdr=apply(Chisq.fdr<alpha, 1, mean),
MC.fdr=apply(MC.fdr<alpha, 1, mean))
Prop.pops.out.of.HWE
## Chisq MC Chisq.fdr MC.fdr
## Airplane 0.000 0.250 0.000 0
## Bachelor 0.000 0.000 0.000 0
## BarkingFox 0.125 0.000 0.000 0
## Bob 0.125 0.125 0.000 0
## Cache 0.125 0.000 0.000 0
## Egg 0.000 0.000 0.000 0
## Frog 0.000 0.000 0.000 0
## GentianL 0.125 0.000 0.000 0
## ParagonL 0.125 0.000 0.125 0
## Pothole 0.000 0.000 0.000 0
## ShipIsland 0.125 0.000 0.000 0
## Skyhigh 0.125 0.125 0.125 0
After using false discovery rate correction for the 8 * 12 = 96 tests performed, very few combinations of locus and population were out of HWE based on the chi-squared test, and none with the MC test.
Note: exact results are likely to differ somewhat between runs due to the permutation tests.
See also: https://grunwaldlab.github.io/Population_Genetics_in_R/Linkage_disequilibrium.html
For microsatellite markers, we typically don’t know where on the genome they are located. The closer together two markers are on a chromosome, the more likely they are inherited together, which means that they don’t really provide independent information. Testing for linkage disequilibrium assesses this, for each pair of loci, by checking whether alleles of two loci are statistically associated.
This step is especially important when developing a new set of markers. You may want to drop (the less informative) one marker of any pair of linked loci.
Here, we start with performing an overall test of linkage disequilibrium (the null hypothesis is that there is no linkage among the set of markers). Two indices are calculated and tested: an index of association (Ia; Brown et al. 1980) and a measure of correlation (rbarD; Agapow and Burt 2001), which is less biased (see URL above). The number of permutations is specified by ‘sample = 199.’
Overall, there is statistically significant association among the markers (p-value: prD = 0.005; also left figure). Recall that the power of a statistical test increases with sample size, and here we have n = 181, hence even a small effect may be statistically significant. Hence we look at effect size, i.e., the actual strength of the pairwise associations (right figure).
poppr::ia(Frogs.genind, sample=199)
## Ia p.Ia rbarD p.rD
## 0.33744318 0.00500000 0.05366542 0.00500000
LD.pair <- poppr::pair.ia(Frogs.genind)
LD.pair
## Ia rbarD
## A:B 0.0485 0.0492
## A:C -0.0314 -0.0335
## A:D 0.1886 0.1966
## A:E 0.0560 0.0569
## A:F -0.0272 -0.0452
## A:G 0.0931 0.0935
## A:H 0.0294 0.0304
## B:C -0.0329 -0.0375
## B:D 0.0903 0.0911
## B:E 0.0910 0.0910
## B:F -0.0013 -0.0025
## B:G 0.0451 0.0452
## B:H 0.0621 0.0623
## C:D -0.0859 -0.1049
## C:E 0.0247 0.0284
## C:F -0.0311 -0.0397
## C:G -0.0107 -0.0118
## C:H 0.0012 0.0015
## D:E 0.0455 0.0458
## D:F 0.0094 0.0199
## D:G 0.0069 0.0070
## D:H 0.0461 0.0462
## E:F 0.0013 0.0025
## E:G 0.0453 0.0454
## E:H 0.2153 0.2159
## F:G 0.0167 0.0299
## F:H 0.0296 0.0606
## G:H 0.0942 0.0953
The strongest correlation is around 0.2, for markers E and H.
Effect size: If rbarD can be interpreted similarly to a linear correlation coefficient r, that would mean that less than 5% of the variation in one marker is shared with the other marker (recall from stats: the amount of variance explained in regression, Rsquared, is the square of the linear correlation coefficient). This is probably not large enough to worry about.
See also Dakin and Avise (2004): http://www.nature.com/articles/6800545
One potential drawback for microsatellites as molecular markers is the presence of null alleles that fail to amplify, thus they couldn’t be detected in the PCR assays.
The function ‘null.all’ takes a genind object and returns a list with two components (‘homozygotes’ and ‘null.allele.freq’), and each of these is again a list. See ‘?null.all’ for details and choice of method.
List ‘homozygotes’:
homozygotes$observed: observed number of homozygotes for each allele at each locus
homozygotes$bootstrap: distribution of the expected number of homozygotes
homozygotes$probability.obs: probability of observing the number of homozygotes
Note: we are turning off warnings here (currently the code throws a warning for each sample, though results don’t seem to be affected). Code takes a moment to run.
# Null alleles: depends on method! See help file.
Null.alleles <- PopGenReport::null.all(Frogs.genind)
Null.alleles$homozygotes$probability.obs
## Allele-1 Allele-2 Allele-3 Allele-4 Allele-5 Allele-6 Allele-7 Allele-8
## A 0.098 0.000 0.027 NA NA NA NA NA
## B 0.151 0.045 0.072 0.003 NA NA NA NA
## C 0.182 0.003 0.000 0.005 NA NA NA NA
## D 0.000 0.000 0.196 0.004 NA NA NA NA
## E 0.055 0.013 0.020 0.266 0.064 0.043 0.742 0.531
## F 0.466 0.001 0.020 NA NA NA NA NA
## G 0.074 0.087 0.016 0.001 NA NA NA NA
## H 0.452 0.119 0.009 0.023 0.260 0.307 0.007 0.002
## Allele-9
## A NA
## B NA
## C NA
## D NA
## E 0
## F NA
## G NA
## H NA
List ‘null.allele.freq’:
null.allele.freq$summary1: null allele frequency estimates based upon the forumulas of Chakraborty et al. (1994)
null.allele.freq$summary2: null allele frequency estimates based upon the forumulas of Brookfield (1996)
From the help file: “Brookfield (1996) provides a brief discussion on which estimator should be used. In summary, it was recommended that Chakraborty et al. (1994)’s method (e.g. summary1) be used if there are individuals with no bands at a locus seen, but they are discounted as possible artefacts. If all individuals have one or more bands at a locus then Brookfield (1996)’s method (e.g. summary2) should be used.” In this case, we have many individuals with missing values for both alleles, hence better use summary1.
Each summary table contains a summary with observed, median, 2.5th percentile and 97.5the percentile. The percentiles form a 95% confidence interval. From the help file: “If the 95% confidence interval includes zero, it indicates that the frequency of null alleles at a locus does not significantly differ from zero.”
{cat(" summary1 (Chakraborty et al. 1994):", "\n")
round(Null.alleles$null.allele.freq$summary1,2)}
## summary1 (Chakraborty et al. 1994):
## A B C D E F G H
## Observed frequency 0.24 0.08 0.23 0.25 0.06 0.00 0.11 0.04
## Median frequency 0.24 0.08 0.22 0.25 0.06 0.00 0.11 0.04
## 2.5th percentile 0.08 0.02 0.03 0.17 0.02 -0.01 0.01 0.00
## 97.5th percentile 0.42 0.16 0.48 0.33 0.11 0.00 0.22 0.09
{cat("summary2 (Brookfield et al. 1996):", "\n")
round(Null.alleles$null.allele.freq$summary2,2)}
## summary2 (Brookfield et al. 1996):
## A B C D E F G H
## Observed frequency 0.06 0.05 0.05 0.17 0.05 0 0.07 0.04
## Median frequency 0.06 0.05 0.05 0.17 0.05 0 0.07 0.03
## 2.5th percentile 0.02 0.01 0.01 0.12 0.02 0 0.01 0.00
## 97.5th percentile 0.10 0.10 0.10 0.23 0.10 0 0.13 0.08
For this example, both methods suggest that there may be null alleles in most loci. However, the estimates of the frequency of null alleles differ a lot between the two methods.
A different approach for estimating null alleles at microsatellite loci, based on the Estimation-Maximization algorithm, is implemented in FreeNA (outside of the R environment). FreeNA will directly provide Fst values and some other measurements using the corrected allele frequencies: https://www1.montpellier.inra.fr/CBGP/software/FreeNA/
Relevant papers for the Estimation-Maximization algorithm:
Kalinowski et al. (2006), Conservation Genetics 7:991–995, doi: 10.1007/s10592-006-9134-9.
Chapuis and Estoup (2007), Mol. Biol. Evol. 24:621–631, doi: 10.1093/molbev/msl191.
Spatial genetic structure: The Columbia spotted frog data used in this lab come from a study area with a great deal of genetic structure. If we use a population assignment test (see Week 9), each basin is a separate unit with significant substructure within basins. Testing for significant genetic distance, most pairs of ponds have a genetic distance that is significantly different from zero.
HWE: Therefore, we expect global estimates (i.e., the whole dataset) of He to be out of HWE due to population substructure (HWE assumes panmictic populations). We would also expect data to be out of HWE when analyzing data by basin due to population substructure. We could, then, test HWE and linkage disequilibrium at the pond level (as shown here). However, some ponds have low sample sizes (which refleect a low number of individuals: based on repeated surveys of sites, most if not nearly all animals were captured).
Linkage: These low samples sizes can result in deviations from HWE and in linkage disequilibrium as an artifact of sample size and/or breeding structure (if one male is responsible for all breeding, his genes would appear “linked” in offspring genotypes, even though they are not physically linked in the genome). In addition, each pond may not represent a “population,” but only a portion of a population.
So, what do we do? We can look for patterns. Are there loci that are consistently out of HWE across samples sites while other loci are not out of HWE suggesting that there are null alleles or other data quality control issues? With the full data set, this was not the case. Are there loci that are consistently linked across different ponds (while others are not), suggesting that they are linked? With the full dataset, this was not the case.
Null alleles: Missing data can imply null (non-amplifying) alleles. In this case, while there are loci that have higher “drop-out” rates than others, this is more likely due to some loci being more difficult to call (the authors were very strict in removing any questionable data; equivocal calls resulting in no data). In addition, some of the samples used in this study were toe clips which with low yields of DNA, resulting in incomplete genotypes. While presence of null alleles is a possibility, genetic structure, breeding patterns at low population size and aggressive quality control of genotypes can all explain the results. Finally, with all of the structure in these data, there are examples (at the basin level and pond level) of unique alleles that are fixed or nearly fixed. When the data are assessed globally, this will result in a similar pattern to the presence of null alleles and will result in positive null allele tests.
These measures are typically quantified per population.
Both nominal sample size (number of frogs sampled) and valid sample size (e.g., for each locus, the number of frogs with non-missing genetic data) vary between sites. We would expect the number of alleles found in a population to increase with the number of individuals genotyped.
We can check this by plotting the number of alleles against sample size. Here we create an object ‘Sum’ that contains the summary of the genind object, then we can access its elements by ‘$’ to plot what we need. The function ‘names’ lists the names of the elements, which reduced the guesswork.
Sum <- summary(Frogs.genind)
names(Sum)
## [1] "n" "n.by.pop" "loc.n.all" "pop.n.all" "NA.perc" "Hobs"
## [7] "Hexp"
The site names are quite long, hence we print the labels vertically by setting ‘las=3,’ and we modify the margins (‘mar’). The four numbers give the size of each margin in the following order: bottom, left, top, right.
We add a regression line to the scatterplot with the function ‘abline,’ where we specify the linear regression model with the function ‘lm.’ In this case, we model the response ‘pop.n.all’ as a function of predictor ‘n.by.pop.’
The barchart (left) shows that there is considerable variation among ponds in the number of alleles observed across all loci. The scatterplot (right) with the red regression line shows that the number of alleles increases with sample size.
par(mar=c(5.5, 4.5,1,1))
barplot(Sum$pop.n.all, las=3,
xlab = "", ylab = "Number of alleles")
plot(Sum$n.by.pop, Sum$pop.n.all,
xlab = "Sample size", ylab = "Number of alleles")
abline(lm(Sum$pop.n.all ~ Sum$n.by.pop), col = "red")
Hence we should not compare the number of alleles directly. Instead, we’ll use rarefied allelic richness (Ar).
By default, the function ‘allel.rich’ finds the lowest valid sample size across all populations and loci, and multiplies it by the ploidy level. The number is stored as ‘Richness$alleles.sampled’ (here: 3 individuals * 2 alleles = 6 alleles). Alternatively, this number can be set with the ‘min.alleles’ argument.
Richness <- PopGenReport::allel.rich(Frogs.genind, min.alleles = NULL)
Richness$alleles.sampled
## [1] 6
Populations with more alleles are resampled to determine the average allelic richness among the minimum number of alleles. Here, this means that 6 alleles are sampled from each population, allelic richness is calculated, and the process is repeated many times to determine the average).
Let’s plot the results again. The barchart shows that there is considerable variation in genetic diversity among ponds. The scatterplot against sample size (here: for each population, the average number of valid alleles across loci) suggests that the variation is not related to sample size. The regression line (red) is almost horizontal.
Here we plot the average Ar across loci, so that the result does not depend on the number of loci used.
par(mar=c(5.5, 4.5,1,1))
barplot(Richness$mean.richness, las=3, ylab="Rarefied allelic richness (Ar)")
plot(colMeans(Richness$pop.sizes), Richness$mean.richness,
xlab="Valid sample size",
ylab="Rarefied allelic richness (Ar)")
abline(lm(Richness$mean.richness ~ colMeans(Richness$pop.sizes)), col="red")
Note: Writing the ‘genind’ summary into an object ‘Sum’ allows accessing its attributes by name.
Sum <- summary(Frogs.genind)
names(Sum)
## [1] "n" "n.by.pop" "loc.n.all" "pop.n.all" "NA.perc" "Hobs"
## [7] "Hexp"
Expected heterozygosity (here: Hexp) is the heterozygosity expected in a population under HWE, and observed heterozygosity (here: Hobs) is the observed number of heterozygotes at a locus divided by the total number of genotyped individuals. Here are the global values (pooled across all populations):
par(mar=c(3, 4.5,1,1))
barplot(Sum$Hexp, ylim=c(0,1), ylab="Expected heterozygosity")
barplot(Sum$Hobs, ylim=c(0,1), ylab="Observed heterozygosity")
By locus and population:
Here we use ‘seppop’ to split the genind object by population, then ‘sapply’ to apply function ‘summary’ to each population.
Hobs <- t(sapply(seppop(Frogs.genind), function(ls) summary(ls)$Hobs))
Hexp <- t(sapply(seppop(Frogs.genind), function(ls) summary(ls)$Hexp))
{cat("Expected heterozygosity (Hexp):", "\n")
round(Hexp, 2)
cat("\n", "Observed heterozygosity (Hobs):", "\n")
round(Hobs, 2)}
## Expected heterozygosity (Hexp):
##
## Observed heterozygosity (Hobs):
## A B C D E F G H
## Airplane 0.25 0.33 0.00 0.33 0.62 0.00 0.16 0.74
## Bachelor 0.00 0.38 0.40 0.25 0.88 0.00 0.33 0.75
## BarkingFox 0.07 0.14 0.00 0.57 0.79 0.07 0.22 0.23
## Bob 0.15 0.38 0.42 0.15 0.92 0.00 0.11 0.67
## Cache 0.00 0.83 0.00 0.29 1.00 0.00 0.40 0.86
## Egg 0.00 0.41 0.00 0.00 0.87 0.00 0.36 0.87
## Frog 0.00 0.38 0.14 0.56 0.86 0.00 0.33 1.00
## GentianL 0.11 0.95 0.00 0.74 0.90 0.15 0.56 0.84
## ParagonL 0.00 0.11 0.08 0.16 0.33 0.00 0.36 0.47
## Pothole 0.00 0.00 0.00 0.33 0.50 0.00 0.57 0.73
## ShipIsland 0.41 0.41 0.00 0.59 0.53 0.00 0.36 0.71
## Skyhigh 0.04 0.57 0.11 0.30 0.52 0.00 0.77 0.59
Locus F shows variation only in two populations (i.e., Hexp = 0 in 10 populations).
Let’s plot the average across all loci for each population:
Here we use ‘apply’ to apply the function ‘mean’ to the rows (MARGIN = 1). For columns, use ‘MARGIN = 2.’
par(mar=c(5.5, 4.5, 1, 1))
Hobs.pop <- apply(Hobs, MARGIN = 1, FUN = mean)
Hexp.pop <- apply(Hexp, 1, mean)
barplot(Hexp.pop, ylim=c(0,1), las=3, ylab="Expected heterozygosity")
barplot(Hobs.pop, ylim=c(0,1), las=3, ylab="Observed heterozygosity")
Frogs.diversity <- data.frame(Pop = names(Hobs.pop),
n = Sum$n.by.pop,
Hobs = Hobs.pop,
Hexp = Hexp.pop,
Ar = Richness$mean.richness)
Frogs.diversity
## Pop n Hobs Hexp Ar
## Airplane Airplane 21 0.3038064 0.3433019 1.939629
## Bachelor Bachelor 8 0.3729167 0.3651953 2.003673
## BarkingFox BarkingFox 14 0.2619811 0.2818609 1.741904
## Bob Bob 13 0.3504274 0.3171011 1.961791
## Cache Cache 7 0.4220238 0.3923413 2.085115
## Egg Egg 17 0.3135918 0.2951215 1.841382
## Frog Frog 9 0.4079861 0.3659439 1.925469
## GentianL GentianL 20 0.5296418 0.4529000 2.331599
## ParagonL ParagonL 19 0.1880302 0.2200536 1.539923
## Pothole Pothole 13 0.2665043 0.2328137 1.597687
## ShipIsland ShipIsland 17 0.3755252 0.3918983 1.949255
## Skyhigh Skyhigh 23 0.3632542 0.3552167 1.953610
You can save the R object ‘Frogs.diversity’ with the code below (need to uncomment by removing the hashtags ‘#’):
#require(here)
#if(!dir.exists(paste0(here(),"/output"))) dir.create(paste0(here(),"/output"))
#save(Frogs.diversity, file = paste0(here(),"/output/Frogs.diversity.RData"))
#load(paste0(here(),"/output/Frogs.diversity.RData"))
For some analyses, we will need to aggregate data from the individual to the population level, e.g. as a table of allele frequencies per population.
Here we convert the ‘genind’ object to a ‘genpop’ object (NOT the same as a ‘genepop’ object!). This is defined in the package ‘adegenet’ to hold population-level genetic data. The function ‘genind2genpop’ obviously converts from ‘genind’ to ‘genpop.’
Frogs.genpop <- adegenet::genind2genpop(Frogs.genind)
##
## Converting data from a genind to a genpop object...
##
## ...done.
The function ‘makefreq’ extracts the table with allele frequencies from the ‘genpop’ object. We’ll plot just a few lines and alleles.
Freq <- adegenet::makefreq(Frogs.genpop)
##
## Finding allelic frequencies from a genpop object...
##
## ...done.
round(Freq[1:6,1:10], 2)
## A.1 A.2 A.3 B.1 B.3 B.2 B.4 C.1 C.2 C.4
## Airplane 0.62 0.35 0.03 0.83 0.17 0.00 0.00 1.00 0.00 0
## Bachelor 1.00 0.00 0.00 0.56 0.06 0.38 0.00 0.80 0.20 0
## BarkingFox 0.96 0.00 0.04 0.86 0.04 0.11 0.00 0.88 0.12 0
## Bob 0.92 0.00 0.08 0.81 0.00 0.12 0.08 0.79 0.21 0
## Cache 1.00 0.00 0.00 0.42 0.00 0.50 0.08 0.75 0.25 0
## Egg 1.00 0.00 0.00 0.74 0.00 0.26 0.00 1.00 0.00 0
The allele frequencies of all alleles from the same locus (e.g., A.1, A.2 and A.3) should sum to 1 for each population. With eight loci, the row sums should thus add to 8.
apply(Freq, MARGIN = 1, FUN = sum) # Just checking
## Airplane Bachelor BarkingFox Bob Cache Egg Frog
## 8 8 8 8 8 8 8
## GentianL ParagonL Pothole ShipIsland Skyhigh
## 8 8 8 8 8
In this week of Landscape genetics we will familiarize ourselves with common genetic diversity analyses using the Kentucky Arrow Darter set used in Blanton et al 2019.
First we will load in the neccesary packages (if not loaded in above):
library(adegenet)
library(gstudio)
library(LandGenCourse)
library(tibble)
library(here)
library(vcfR)
library(pinfsc50)
library(utils)
Next, we will need to load in the Kentucky Arrow Darter genetic data.
kad_data <- read.csv("~/Desktop/landscape_genetics_2022/labs/landgen_lab2_geneticdiversity/kad_dataset.csv")
Kad.genind <- df2genind(X=kad_data[,c(3:13)], sep=":", ncode=NULL, ind.names= kad_data$Useful_ID, loc.names=NULL, pop=kad_data$Population_ID, NA.char="NA", ploidy=2, type="codom", strata=NULL, hierarchy=NULL)
Kad.genind
## /// GENIND OBJECT /////////
##
## // 213 individuals; 11 loci; 117 alleles; size: 138.6 Kb
##
## // Basic content
## @tab: 213 x 117 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 3-31)
## @loc.fac: locus factor for the 117 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: df2genind(X = kad_data[, c(3:13)], sep = ":", ncode = NULL, ind.names = kad_data$Useful_ID,
## loc.names = NULL, pop = kad_data$Population_ID, NA.char = "NA",
## ploidy = 2, type = "codom", strata = NULL, hierarchy = NULL)
##
## // Optional content
## @pop: population of each individual (group size range: 9-31)
The genetic resolution depends on the number of markers and their polymorphism. The table above and the summary function for genind objects together provide this information. Now we run the summary function:
summary(Kad.genind)
##
## // Number of individuals: 213
## // Group sizes: 24 19 26 24 9 27 14 31 18 21
## // Number of alleles per locus: 5 3 10 8 10 10 10 11 31 16 3
## // Number of alleles per group: 60 29 51 25 14 41 41 44 62 58
## // Percentage of missing data: 2.94 %
## // Observed heterozygosity: 0.3 0.04 0.45 0.42 0.63 0.51 0.7 0.49 0.69 0.6 0.22
## // Expected heterozygosity: 0.54 0.19 0.63 0.55 0.83 0.84 0.88 0.82 0.94 0.88 0.48
The output of the summary function shows us the following:
11 loci with 3 - 31 alleles (117 in total) Expected heterozygosity varies between 0.19 and 0.94 There is a small amount of missing values (2.94%)
See the first lab example for explanations of HWE
The default is B = 1000 permutations (set B = 0 to skip this test). Here we use the function ‘round’ with argument ‘digits = 3’ to round all values to 3 decimals.
round(pegas::hw.test(Kad.genind, B = 1000), digits = 3)
## chi^2 df Pr(chi^2 >) Pr.exact
## C2 89.599 10 0 0
## C3 345.456 3 0 0
## C6 323.814 45 0 0
## C117 434.427 28 0 0
## D131 142.252 45 0 0
## C26b 529.005 45 0 0
## Cv12 197.543 45 0 0
## D10 307.827 55 0 0
## D11 1402.630 465 0 0
## D107 591.560 120 0 0
## D116 63.818 3 0 0
Both tests suggest that all loci are out of HWE globally (across all 213 individuals). This is indicated by the 0 values (they would be 1 if they were in HWE)
Next, we check for HWE of each locus in each population.
Notes on the code: The curly brackets ‘{ }’ below are used to keep the output from multiple lines together in the html file. Function ‘seppop’ splits the genind object by population. We use ‘sapply’ to apply the function ‘hw.test’ from package ‘pegas’ to each population (see this week’s video and tutorial). We set ‘B=0’ to specify that we don’t need any permutations right now. The function ‘t’ takes the transpose of the resulting matrix, which means it flips rows and columns. This works on a matrix, not a data frame, hence we use ‘data.matrix’ to temporarily interpret the data frame as a matrix.
HWE.test <- data.frame(sapply(seppop(Kad.genind),
function(ls) pegas::hw.test(ls, B=0)[,3]))
HWE.test.chisq <- t(data.matrix(HWE.test))
{cat("Chi-squared test (p-values):", "\n")
round(HWE.test.chisq,3)}
## Chi-squared test (p-values):
## C2 C3 C6 C117 D131 C26b Cv12 D10 D11 D107 D116
## X1 0.772 1.000 0.285 0.891 0.547 0.889 0.331 0.000 0.959 0.003 0.392
## X2 1.000 0.498 0.243 0.356 0.678 1.000 0.865 0.000 0.000 0.001 1.000
## X3 0.929 1.000 0.980 0.690 0.425 0.430 0.748 0.464 0.452 0.830 0.190
## X4 0.744 1.000 0.744 1.000 0.744 0.484 0.554 0.910 0.741 0.917 1.000
## X5 1.000 0.003 1.000 1.000 0.987 1.000 1.000 1.000 1.000 1.000 1.000
## X6 1.000 0.000 0.640 0.096 0.866 0.048 0.648 0.182 0.488 0.313 0.760
## X7 0.079 1.000 0.033 0.360 0.105 0.110 0.886 0.962 0.030 0.437 0.684
## X8 1.000 0.000 0.000 0.000 0.720 0.655 0.644 0.764 0.927 0.056 1.000
## X9 0.528 0.000 0.017 0.001 0.872 0.637 0.712 0.009 0.552 0.003 0.494
## X10 0.589 0.000 0.007 0.740 0.693 0.873 0.183 0.036 0.000 0.704 0.992
Let’s repeat this with a Monte Carlo permutation test with B = 1000 replicates:
HWE.test <- data.frame(sapply(seppop(Kad.genind),
function(ls) pegas::hw.test(ls, B=1000)[,4]))
HWE.test.MC <- t(data.matrix(HWE.test))
{cat("MC permuation test (p-values):", "\n")
round(HWE.test.MC,3)}
## MC permuation test (p-values):
## C2 C3 C6 C117 D131 C26b Cv12 D10 D11 D107 D116
## X1 0.704 1.000 0.366 1.000 0.423 1.000 0.413 0.000 0.933 0.352 0.706
## X2 1.000 0.637 0.178 0.645 0.638 1.000 1.000 0.021 0.107 0.231 1.000
## X3 0.798 1.000 0.846 0.950 0.078 0.343 0.684 0.144 0.315 0.564 0.199
## X4 1.000 1.000 1.000 1.000 1.000 1.000 0.618 1.000 0.674 1.000 1.000
## X5 1.000 0.064 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## X6 1.000 0.001 0.818 0.229 0.385 0.115 0.707 0.068 0.295 0.208 1.000
## X7 0.036 1.000 0.043 0.575 0.069 0.109 0.920 0.815 0.229 0.253 0.783
## X8 1.000 0.001 0.030 0.004 0.862 1.000 0.522 0.629 0.453 0.003 1.000
## X9 0.559 0.036 0.185 0.008 0.748 0.393 0.495 0.021 0.603 0.148 1.000
## X10 0.643 0.000 0.097 1.000 0.700 0.949 0.045 0.210 0.000 0.862 1.000
To summarize, let’s calculate, for each locus, the proportion of populations where it was out of HWE. Here we’ll use the conservative cut-off of alpha = 0.05 for each test. There are various ways of modifying this, including a simple Bonferroni correction, where we divide alpha by the number of tests, which you can activate here by removing the ## i. front of the line.
We write the results into a data frame ‘Prop.loci.out.of.HWE’ and use ‘=’ to specify the name for each column.
alpha=0.05
Prop.loci.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 2, mean),
MC=apply(HWE.test.MC<alpha, 2, mean))
Prop.loci.out.of.HWE
## Chisq MC
## C2 0.0 0.1
## C3 0.5 0.4
## C6 0.4 0.2
## C117 0.2 0.2
## D131 0.0 0.0
## C26b 0.1 0.0
## Cv12 0.0 0.1
## D10 0.4 0.3
## D11 0.3 0.1
## D107 0.3 0.1
## D116 0.0 0.0
And similarly, for each population, the proportion of loci that were out of HWE:
Prop.pops.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 1, mean),
MC=apply(HWE.test.MC<alpha, 1, mean))
Prop.pops.out.of.HWE
## Chisq MC
## X1 0.18181818 0.09090909
## X2 0.27272727 0.09090909
## X3 0.00000000 0.00000000
## X4 0.00000000 0.00000000
## X5 0.09090909 0.00000000
## X6 0.18181818 0.09090909
## X7 0.18181818 0.18181818
## X8 0.27272727 0.36363636
## X9 0.45454545 0.27272727
## X10 0.36363636 0.27272727
The results suggest that:
While most loci are out of HWE globally, this is largely explained by subdivision (variation in allele frequencies among local populations indicating limited gene flow).
No locus is consistently out of HWE across populations (loci probably not affected by selection).
Some populations are somewhat consistently out of HWE across loci, indicating possible recent major bottlenecks/ founder effects (Populations 9 and 10).
Let’s repeat this with ‘false discovery rate’ correction for the number of tests. Here we use the function ‘p.adjust’ with the argument ‘method=“fdr”’ to adjust the p-values from the previous tests. This returns a vector of length 96 (the number of p-values used), which we convert back into a matrix of 12 rows (pops) by 8 columns (loci). Then we procede as above.
Chisq.fdr <- matrix(p.adjust(HWE.test.chisq,method="fdr"),
nrow=nrow(HWE.test.chisq))
MC.fdr <- matrix(p.adjust(HWE.test.MC, method="fdr"),
nrow=nrow(HWE.test.MC))
Prop.pops.out.of.HWE <- data.frame(Chisq=apply(HWE.test.chisq<alpha, 1, mean),
MC=apply(HWE.test.MC<alpha, 1, mean),
Chisq.fdr=apply(Chisq.fdr<alpha, 1, mean),
MC.fdr=apply(MC.fdr<alpha, 1, mean))
Prop.pops.out.of.HWE
## Chisq MC Chisq.fdr MC.fdr
## X1 0.18181818 0.09090909 0.18181818 0.09090909
## X2 0.27272727 0.09090909 0.27272727 0.00000000
## X3 0.00000000 0.00000000 0.00000000 0.00000000
## X4 0.00000000 0.00000000 0.00000000 0.00000000
## X5 0.09090909 0.00000000 0.09090909 0.00000000
## X6 0.18181818 0.09090909 0.09090909 0.09090909
## X7 0.18181818 0.18181818 0.00000000 0.00000000
## X8 0.27272727 0.36363636 0.27272727 0.09090909
## X9 0.45454545 0.27272727 0.27272727 0.00000000
## X10 0.36363636 0.27272727 0.27272727 0.18181818
After using false discovery rate correction for the tests performed, populations 9 and 10 showed a small amount of combinations of locus and populations were out of HWE based on the chi-squared test and MC test.
For microsatellite markers, we typically don’t know where on the genome they are located. The closer together two markers are on a chromosome, the more likely they are inherited together, which means that they don’t really provide independent information. Testing for linkage disequilibrium assesses this, for each pair of loci, by checking whether alleles of two loci are statistically associated.
This step is especially important when developing a new set of markers. You may want to drop (the less informative) one marker of any pair of linked loci.
Here, we start with performing an overall test of linkage disequilibrium (the null hypothesis is that there is no linkage among the set of markers). Two indices are calculated and tested: an index of association (Ia; Brown et al. 1980) and a measure of correlation (rbarD; Agapow and Burt 2001), which is less biased (see URL above). The number of permutations is specified by ‘sample = 213.’
Overall, there is no statistical significant association among the markers (p-value: prD = 0.0046; also left figure). Recall that the power of a statistical test increases with sample size, and here we have n = 213, hence even a small effect may be statistically significant. Hence we look at effect size, i.e., the actual strength of the pairwise associations (right figure).
poppr::ia(Kad.genind, sample=213)
## Ia p.Ia rbarD p.rD
## 1.154575532 0.004672897 0.117657897 0.004672897
LD.pair <- poppr::pair.ia(Kad.genind)
LD.pair
## Ia rbarD
## C2:C3 0.0850 0.0853
## C2:C6 0.0461 0.0463
## C2:C117 0.0787 0.0788
## C2:D131 0.1148 0.1166
## C2:C26b 0.2626 0.2650
## C2:Cv12 0.1495 0.1496
## C2:D10 0.1711 0.1712
## C2:D11 0.1162 0.1331
## C2:D107 0.0749 0.0750
## C2:D116 0.0589 0.0589
## C3:C6 0.0733 0.0733
## C3:C117 -0.0152 -0.0153
## C3:D131 0.0625 0.0628
## C3:C26b 0.0245 0.0245
## C3:Cv12 0.0630 0.0630
## C3:D10 -0.1169 -0.1170
## C3:D11 0.0039 0.0043
## C3:D107 -0.1126 -0.1136
## C3:D116 0.0884 0.0889
## C6:C117 0.0664 0.0665
## C6:D131 0.1312 0.1318
## C6:C26b 0.1781 0.1783
## C6:Cv12 0.1143 0.1144
## C6:D10 0.1518 0.1521
## C6:D11 0.0781 0.0859
## C6:D107 0.0995 0.1005
## C6:D116 0.0723 0.0728
## C117:D131 0.1422 0.1434
## C117:C26b 0.1136 0.1140
## C117:Cv12 0.1513 0.1513
## C117:D10 0.1390 0.1390
## C117:D11 0.0973 0.1089
## C117:D107 0.1131 0.1137
## C117:D116 0.0186 0.0186
## D131:C26b 0.2578 0.2580
## D131:Cv12 0.2358 0.2381
## D131:D10 0.1697 0.1715
## D131:D11 0.1156 0.1229
## D131:D107 0.1933 0.1985
## D131:D116 0.1302 0.1329
## C26b:Cv12 0.1850 0.1858
## C26b:D10 0.3152 0.3168
## C26b:D11 0.2025 0.2185
## C26b:D107 0.2181 0.2220
## C26b:D116 0.0489 0.0495
## Cv12:D10 0.1811 0.1811
## Cv12:D11 0.0657 0.0737
## Cv12:D107 0.2285 0.2295
## Cv12:D116 0.1802 0.1806
## D10:D11 0.1321 0.1488
## D10:D107 0.2548 0.2557
## D10:D116 0.0912 0.0914
## D11:D107 0.1566 0.1840
## D11:D116 0.0720 0.0835
## D107:D116 0.0545 0.0545
The strongest correlation is around 0.3, for markers C26b and D10.
See above section for more information regarding null alleles
Note: we are turning off warnings here (currently the code throws a warning for each sample, though results don’t seem to be affected).
Null.alleles <- PopGenReport::null.all(Kad.genind)
Null.alleles$homozygotes$probability.obs
## Allele-1 Allele-2 Allele-3 Allele-4 Allele-5 Allele-6 Allele-7 Allele-8
## C2 0.001 0.000 0.000 0.018 0.005 NA NA NA
## C3 0.001 0.000 0.000 NA NA NA NA NA
## C6 0.007 0.003 0.015 0.015 0.007 0.017 0.000 0.000
## C117 0.021 0.491 0.005 0.000 0.000 0.000 0.012 0.002
## D131 0.002 0.002 0.002 0.002 0.020 0.000 0.013 0.081
## C26b 0.000 0.265 0.000 0.042 0.320 0.000 0.000 0.000
## Cv12 0.164 0.000 0.107 0.011 0.375 0.003 0.000 0.006
## D10 0.000 0.001 0.000 0.026 0.000 0.338 0.008 0.000
## D11 0.005 0.028 0.020 0.112 0.006 0.002 0.023 0.042
## D107 0.030 0.005 0.000 0.000 0.001 0.276 0.000 0.004
## D116 0.000 0.000 0.011 NA NA NA NA NA
## Allele-9 Allele-10 Allele-11 Allele-12 Allele-13 Allele-14 Allele-15
## C2 NA NA NA NA NA NA NA
## C3 NA NA NA NA NA NA NA
## C6 0.005 0.005 NA NA NA NA NA
## C117 NA NA NA NA NA NA NA
## D131 0.003 0.003 NA NA NA NA NA
## C26b 0.041 0.000 NA NA NA NA NA
## Cv12 0.194 0.000 NA NA NA NA NA
## D10 0.001 0.000 0.002 NA NA NA NA
## D11 0.001 0.000 0.008 0.039 0.012 0.040 0.008
## D107 0.014 0.002 0.000 0.100 0.000 0.021 0.001
## D116 NA NA NA NA NA NA NA
## Allele-16 Allele-17 Allele-18 Allele-19 Allele-20 Allele-21 Allele-22
## C2 NA NA NA NA NA NA NA
## C3 NA NA NA NA NA NA NA
## C6 NA NA NA NA NA NA NA
## C117 NA NA NA NA NA NA NA
## D131 NA NA NA NA NA NA NA
## C26b NA NA NA NA NA NA NA
## Cv12 NA NA NA NA NA NA NA
## D10 NA NA NA NA NA NA NA
## D11 0.002 0.022 0.029 0 0.068 0.108 0.001
## D107 0.002 NA NA NA NA NA NA
## D116 NA NA NA NA NA NA NA
## Allele-23 Allele-24 Allele-25 Allele-26 Allele-27 Allele-28 Allele-29
## C2 NA NA NA NA NA NA NA
## C3 NA NA NA NA NA NA NA
## C6 NA NA NA NA NA NA NA
## C117 NA NA NA NA NA NA NA
## D131 NA NA NA NA NA NA NA
## C26b NA NA NA NA NA NA NA
## Cv12 NA NA NA NA NA NA NA
## D10 NA NA NA NA NA NA NA
## D11 0 0.002 0 0.005 0.052 0 0.019
## D107 NA NA NA NA NA NA NA
## D116 NA NA NA NA NA NA NA
## Allele-30 Allele-31
## C2 NA NA
## C3 NA NA
## C6 NA NA
## C117 NA NA
## D131 NA NA
## C26b NA NA
## Cv12 NA NA
## D10 NA NA
## D11 0.004 0.002
## D107 NA NA
## D116 NA NA
We will use summary one (discussed in lab example 1) since we have individuals missing both alleles
{cat(" summary1 (Chakraborty et al. 1994):", "\n")
round(Null.alleles$null.allele.freq$summary1,2)}
## summary1 (Chakraborty et al. 1994):
## C2 C3 C6 C117 D131 C26b Cv12 D10 D11 D107 D116
## Observed frequency 0.29 0.66 0.16 0.13 0.14 0.25 0.11 0.25 0.15 0.19 0.37
## Median frequency 0.29 0.67 0.16 0.12 0.13 0.24 0.11 0.25 0.15 0.18 0.36
## 2.5th percentile 0.21 0.49 0.09 0.06 0.09 0.18 0.07 0.20 0.11 0.13 0.27
## 97.5th percentile 0.38 0.85 0.24 0.19 0.19 0.32 0.16 0.31 0.19 0.24 0.48
{cat("summary2 (Brookfield et al. 1996):", "\n")
round(Null.alleles$null.allele.freq$summary2,2)}
## summary2 (Brookfield et al. 1996):
## C2 C3 C6 C117 D131 C26b Cv12 D10 D11 D107 D116
## Observed frequency 0.19 0.14 0.12 0.09 0.12 0.22 0.11 0.22 0.15 0.17 0.21
## Median frequency 0.19 0.14 0.12 0.08 0.12 0.22 0.11 0.22 0.14 0.17 0.21
## 2.5th percentile 0.13 0.09 0.07 0.04 0.08 0.17 0.07 0.17 0.11 0.12 0.16
## 97.5th percentile 0.24 0.20 0.17 0.13 0.17 0.28 0.15 0.27 0.18 0.22 0.26
Spatial genetic structure:
The Kentucky Arrow Darter data used in this lab includes a great deal of genetic structure.
HWE
Therefore, we expect global estimates (i.e., the whole dataset) of He to be out of HWE due to population substructure (HWE assumes panmictic populations). We would also expect data to be out of HWE when analyzing data by populations due to population substructure. We could, then, test HWE and linkage disequilibrium at the population level.
Linkage:
The smaller samples sizes we have can result in deviations from HWE and in linkage disequilibrium as an artifact of sample size and/or breeding structure (if one male is responsible for all breeding, his genes would appear “linked” in offspring genotypes, even though they are not physically linked in the genome).
So, what do we do? We can look for patterns. Are there loci that are consistently out of HWE across samples sites while other loci are not out of HWE suggesting that there are null alleles or other data quality control issues? With the full data set, this was not the case. Are there loci that are consistently linked across different ponds (while others are not), suggesting that they are linked? With the full dataset, this was not the case.
Null alleles:
Missing data can imply null (non-amplifying) alleles. In this case, while there are loci that have higher “drop-out” rates than others, this is more likely due to some loci being more difficult to call. In addition, some of the samples used in this study were smaller fin clips with lower yields of DNA, resulting in incomplete genotypes. While presence of null alleles is a possibility, genetic structure, breeding patterns at low population size and aggressive quality control of genotypes can all explain the results. Finally, with all of the structure in these data, there are examples at population level of unique alleles that are fixed or nearly fixed.
These measures are typically quantified per population.
Both nominal sample size (number of fish sampled) and valid sample size (e.g., for each locus, the number of fish with non-missing genetic data) vary between sites. We would expect the number of alleles found in a population to increase with the number of individuals genotyped.
We can check this by plotting the number of alleles against sample size. Here we create an object ‘Sum’ that contains the summary of the genind object, then we can access its elements by ‘$’ to plot what we need. The function ‘names’ lists the names of the elements, which reduced the guesswork.
Sum <- summary(Kad.genind)
names(Sum)
## [1] "n" "n.by.pop" "loc.n.all" "pop.n.all" "NA.perc" "Hobs"
## [7] "Hexp"
The site names are quite long, hence we print the labels vertically by setting ‘las=3,’ and we modify the margins (‘mar’). The four numbers give the size of each margin in the following order: bottom, left, top, right.
We add a regression line to the scatterplot with the function ‘abline,’ where we specify the linear regression model with the function ‘lm.’ In this case, we model the response ‘pop.n.all’ as a function of predictor ‘n.by.pop.’
The barchart (left) shows that there is considerable variation among populations in the number of alleles observed across all loci. The scatterplot (right) with the red regression line shows that the number of alleles increases with sample size.
par(mar=c(5.5, 4.5,1,1))
barplot(Sum$pop.n.all, las=3,
xlab = "", ylab = "Number of alleles")
plot(Sum$n.by.pop, Sum$pop.n.all,
xlab = "Sample size", ylab = "Number of alleles")
abline(lm(Sum$pop.n.all ~ Sum$n.by.pop), col = "red")
Hence we should not compare the number of alleles directly. Instead, we’ll use rarefied allelic richness (Ar).
By default, the function ‘allel.rich’ finds the lowest valid sample size across all populations and loci, and multiplies it by the ploidy level. The number is stored as ‘Richness$alleles.sampled’ (here: 3 individuals * 2 alleles = 6 alleles). Alternatively, this number can be set with the ‘min.alleles’ argument.
Richness <- PopGenReport::allel.rich(Kad.genind, min.alleles = NULL)
Richness$alleles.sampled
## [1] 10
Populations with more alleles are resampled to determine the average allelic richness among the minimum number of alleles. Here, this means that 10 alleles are sampled from each population, allelic richness is calculated, and the process is repeated many times to determine the average).
Let’s plot the results again. The barchart shows that there is considerable variation in genetic diversity among ponds. The scatterplot against sample size (here: for each population, the average number of valid alleles across loci) suggests that the variation is not as realted to sample size as in the first plot above. The regression line (red) is becoming more horizontal and the range on the y axis has decreased by a lot.
Here we plot the average Ar across loci, so that the result does not depend on the number of loci used.
par(mar=c(5.5, 4.5,1,1))
barplot(Richness$mean.richness, las=3, ylab="Rarefied allelic richness (Ar)")
plot(colMeans(Richness$pop.sizes), Richness$mean.richness,
xlab="Valid sample size",
ylab="Rarefied allelic richness (Ar)")
abline(lm(Richness$mean.richness ~ colMeans(Richness$pop.sizes)), col="red")
Note: Writing the ‘genind’ summary into an object ‘Sum’ allows accessing its attributes by name.
Sum <- summary(Kad.genind)
names(Sum)
## [1] "n" "n.by.pop" "loc.n.all" "pop.n.all" "NA.perc" "Hobs"
## [7] "Hexp"
Expected heterozygosity (here: Hexp) is the heterozygosity expected in a population under HWE, and observed heterozygosity (here: Hobs) is the observed number of heterozygotes at a locus divided by the total number of genotyped individuals. Here are the global values (pooled across all populations):
par(mar=c(3, 4.5,1,1))
barplot(Sum$Hexp, ylim=c(0,1), ylab="Expected heterozygosity")
barplot(Sum$Hobs, ylim=c(0,1), ylab="Observed heterozygosity")
By locus and population:
Here we use ‘seppop’ to split the genind object by population, then ‘sapply’ to apply function ‘summary’ to each population.
Hobs <- t(sapply(seppop(Kad.genind), function(ls) summary(ls)$Hobs))
Hexp <- t(sapply(seppop(Kad.genind), function(ls) summary(ls)$Hexp))
{cat("Expected heterozygosity (Hexp):", "\n")
round(Hexp, 2)
cat("\n", "Observed heterozygosity (Hobs):", "\n")
round(Hobs, 2)}
## Expected heterozygosity (Hexp):
##
## Observed heterozygosity (Hobs):
## C2 C3 C6 C117 D131 C26b Cv12 D10 D11 D107 D116
## 1 0.54 0.00 0.71 0.62 0.71 0.50 0.92 0.29 0.92 0.78 0.58
## 2 0.00 0.42 0.53 0.58 0.42 0.00 0.47 0.00 0.21 0.74 0.00
## 3 0.46 0.00 0.50 0.69 0.73 0.73 0.73 0.42 0.88 0.77 0.23
## 4 0.46 0.00 0.12 0.00 0.46 0.25 0.67 0.54 0.83 0.04 0.00
## 5 0.00 0.00 0.00 0.00 0.22 0.00 0.00 0.00 0.00 0.00 0.00
## 6 0.00 0.00 0.63 0.30 0.52 0.67 0.89 0.54 0.70 0.77 0.11
## 7 0.36 0.00 0.21 0.57 0.64 0.64 0.71 0.75 0.79 0.64 0.57
## 8 0.00 0.00 0.29 0.19 0.90 0.48 0.71 0.63 0.65 0.52 0.00
## 9 0.82 0.00 0.56 0.83 0.61 0.67 0.67 0.78 0.83 0.88 0.28
## 10 0.40 0.00 0.68 0.44 0.81 0.89 0.69 0.87 0.60 0.80 0.56
Population 5 shows little variation in expected heterozygosity across loci
Let’s plot the average across all loci for each population:
Here we use ‘apply’ to apply the function ‘mean’ to the rows (MARGIN = 1). For columns, use ‘MARGIN = 2.’
par(mar=c(5.5, 4.5, 1, 1))
Hobs.pop <- apply(Hobs, MARGIN = 1, FUN = mean)
Hexp.pop <- apply(Hexp, 1, mean)
barplot(Hexp.pop, ylim=c(0,1), las=3, ylab="Expected heterozygosity")
barplot(Hobs.pop, ylim=c(0,1), las=3, ylab="Observed heterozygosity")
Kad.diversity <- data.frame(Pop = names(Hobs.pop),
n = Sum$n.by.pop,
Hobs = Hobs.pop,
Hexp = Hexp.pop,
Ar = Richness$mean.richness)
Kad.diversity
## Pop n Hobs Hexp Ar
## 1 1 24 0.59766140 0.58324965 3.421913
## 2 2 19 0.30622010 0.31453035 1.930518
## 3 3 26 0.55944056 0.56898870 3.259311
## 4 4 24 0.30681818 0.27541035 1.871246
## 5 5 9 0.02020202 0.03647587 1.155860
## 6 6 27 0.46568247 0.46162692 2.727438
## 7 7 14 0.53512397 0.54164116 2.877023
## 8 8 31 0.39762952 0.43281840 2.776163
## 9 9 18 0.62982769 0.60697080 3.713312
## 10 10 21 0.61251176 0.64287164 3.750409
You can save the R object ‘Kad.diversity’ with the code below
##if(!dir.exists(paste0(here(),"/output"))) dir.create(paste0(here(),"/output"))
#save(Kad.diversity, file = paste0(here(),"/output/Kad.diversity.RData"))
#load(paste0(here(),"/output/Kad.diversity.RData"))
For some analyses, we will need to aggregate data from the individual to the population level, e.g. as a table of allele frequencies per population.
Here we convert the ‘genind’ object to a ‘genpop’ object (NOT the same as a ‘genepop’ object!). This is defined in the package ‘adegenet’ to hold population-level genetic data. The function ‘genind2genpop’ obviously converts from ‘genind’ to ‘genpop.’
Kad.genpop <- adegenet::genind2genpop(Kad.genind)
##
## Converting data from a genind to a genpop object...
##
## ...done.
The function ‘makefreq’ extracts the table with allele frequencies from the ‘genpop’ object. We’ll plot just a few lines and alleles.
Freq <- adegenet::makefreq(Kad.genpop)
##
## Finding allelic frequencies from a genpop object...
##
## ...done.
round(Freq[1:6,1:10], 2)
## C2.141 C2.145 C2.133 C2.129 C2.148 C3.214 C3.210 C3.212 C6.272 C6.280
## 1 0.38 0.56 0.06 0 0 1.00 0.00 0 0.40 0.42
## 2 0.00 1.00 0.00 0 0 0.53 0.47 0 0.68 0.00
## 3 0.67 0.29 0.04 0 0 1.00 0.00 0 0.42 0.00
## 4 0.31 0.00 0.69 0 0 1.00 0.00 0 0.06 0.00
## 5 1.00 0.00 0.00 0 0 0.89 0.11 0 0.00 0.00
## 6 1.00 0.00 0.00 0 0 0.89 0.11 0 0.19 0.00
The allele frequencies of all alleles from the same locus (e.g., C2.141) should sum to 1 for each population. With 11 loci, the row sums should thus add to 11.
apply(Freq, MARGIN = 1, FUN = sum) # Just checking
## 1 2 3 4 5 6 7 8 9 10
## 11 11 11 11 11 11 11 11 11 11